ATOM Documentation

← Back to App

# Historical Sync Backfill — Complete Fix Summary

## Problem

Outlook memory ingestion backfill jobs were stuck in paused/failed/cancelled states with 0 entities and 0 neural links. Workers were being killed mid-processing, and semantic extraction was failing with 401 authentication errors.

## Root Causes (9 bugs found and fixed)

| # | Bug | Symptom | Fix | Deployed |

|---|-----|---------|-----|----------|

| 1 | **aiohttp no timeout** — HTTP calls to Microsoft hung for 5 minutes | Jobs stuck running with 0/0 | 30s ClientTimeout on all Microsoft API calls | 2026-05-01 |

| 2 | **Microsoft Graph 504** — $top=1000 with 90-day range timed out Microsoft's servers | Graph API request failed | Lowered to $top=100, added 504 retry with 2s backoff | 2026-05-01 |

| 3 | **Worker self-shutdown** — worker had logic to self-stop after 5 idle minutes | Jobs pending forever, never dequeued | Removed self-shutdown logic entirely; Fly's auto_start_machines handles lifecycle | 2026-05-01 |

| 4a | **No initial heartbeat** — worker didn't send heartbeat before first API call | Reaper marked jobs as stale immediately | Send initial heartbeat before entering fetch loop | 2026-05-01 |

| 4b | **Slow chunk processing** — LanceDB + GraphRAG takes 15-30min per 100-record chunk | Reaper killed jobs as "abandoned" mid-processing | Increased reaper threshold 15→30min | 2026-05-01 |

| 5 | **SQLAlchemy session detachment** — sub-services expired job object from session | "Instance is not persistent within this Session" on chunk commit | Re-query job from DB instead of refresh() | 2026-05-01 |

| 6 | **BYOK key mismatch** — queried openai_api_key (lowercase) but stored as OPENAI_API_KEY | LLM extraction used mock-key, got 401 errors | Case-sensitive match in historical_sync_service.py + BYOKHandler fix | 2026-05-01 |

| 7 | **Email body truncated** — only 500-char preview sent to LLM | Semantic extraction missed entities in email body | Full body (up to 10KB) instead of preview | 2026-05-01 |

| 8 | **Missing asyncio import** — 504 retry logic used asyncio.sleep() | NameError: name 'asyncio' is not defined' | Added import asyncio to outlook_service.py | 2026-05-01 |

| 9 | **Worker process syntax** — fly.toml used wrong command format | Worker failed to start: "No such file or directory" | Shell wrapper: worker = "sh -c '/app/docker-entrypoint.sh worker'" | 2026-05-01 |

### Known Issues (Not Yet Fixed)

| # | Bug | Symptom | Status |

|---|-----|---------|--------|

| 10 | **Transaction errors in GraphRAG** — database operation fails without rollback | InFailedSqlTransaction: current transaction is aborted blocks all subsequent operations | **Needs fix** - Add proper error handling with rollback |

## Additional Improvements

### Performance

- **Chunk size**: 1000→100 records to avoid Microsoft Graph 504 errors

- **Progress calculation**: Uses has_more flag instead of per-page total_count (fixed 150% display bug)

- **Email body**: Full content (up to 10KB) instead of 500-char bodyPreview

### Architecture

- **Dual ingestion paths**:

- **Semantic memory** (LanceDB): Vector embeddings for all tenants, no API key required

- **Structured memory** (GraphRAG): Entity extraction for tenants with BYO key

- **Worker VM**: 2GB memory (1-core) for GraphRAG ingestion headroom

- **Autoscaling**: Multi-worker (1-3 machines) via Fly Machines API

### Deployment Timeline

- **2026-05-01 00:00-01:00 UTC** - Initial fixes (timeout, 504 retry, worker shutdown)

- **2026-05-01 01:00-02:00 UTC** - Session fixes (re-query, asyncio, BYOK key)

- **2026-05-01 02:00-03:00 UTC** - Email body + worker process fixes

- **2026-05-01 03:00-04:00 UTC** - Reaper timeout + dual-memory architecture

## Architecture

Email record (100 per chunk)
    │
    ├─→ _extract_structured_entities() → DiscoveredEntity → Postgres
    │   └─→ Rule-based: from, to, subject, content fields
    │
    ├─→ graphrag.ingest_document() → GraphNode/GraphEdge → Postgres
    │   └─→ LLM-based: people, orgs, topics from email body
    │   └─→ Requires: Tenant BYOK key (OPENAI_API_KEY)
    │
    └─→ lancedb.add_document() × 100 → LanceDB
        └─→ Full text + subject → Vector embeddings
        └─→ Requires: None (runs for all tenants)

## Deployment Commands

### Deploy Latest Changes

git pull origin main
fly deploy -a atom-saas --strategy immediate

### Verify Deployment

# Check machines started
fly status -a atom-saas

# Check worker logs
fly logs -a atom-saas --machine 2861d27a3414e8 | grep -E "(ROLE|Starting|Loaded service)"

## Verification

### Check Recent Jobs

SELECT
    id,
    status,
    records_processed,
    entities_extracted,
    relationships_extracted,
    created_at,
    completed_at
FROM historical_sync_jobs
WHERE created_at > NOW() - INTERVAL '1 hour'
ORDER BY created_at DESC
LIMIT 5;

### Monitor Worker Activity

# Real-time worker logs
fly logs -a atom-saas --machine 2861d27a3414e8 | grep -E "(dequeue|Fetched|Persisting|entities|relationships|LanceDB)"

# Check for errors
fly logs -a atom-saas --machine 2861d27a3414e8 | grep -E "(error|Error|ERROR|failed|Failed)"

### Verify Semantic Extraction

# Check LanceDB documents stored
# (Requires access to production database)

# Check GraphRAG entities created
SELECT COUNT(*) FROM graph_nodes
WHERE tenant_id = '31c06fc4-db22-4740-83ea-48ac14f25810'
  AND created_at > NOW() - INTERVAL '1 hour';

## Expected Performance

Per 100-record chunk:

- **Fetch from Outlook**: ~2-5 seconds

- **LanceDB embeddings**: ~30-60 seconds (synchronous, one-at-a-time)

- **GraphRAG entity extraction**: ~5-15 minutes (if BYOK key available)

- **Total time**: ~8-10 minutes (LanceDB only) or ~15-20 minutes (both paths)

**Note**: LanceDB calls are currently synchronous (not batched with asyncio.gather()). This is a future optimization opportunity.

## Current Status

### ✅ Working

- Worker stays alive and processes jobs continuously

- Fetches 100 emails per chunk without 504 errors

- Full email body (10KB) available for processing

- No premature reaper kills (30-minute threshold)

- LanceDB embeddings created for all tenants

- BYOK tenants get GraphRAG entity extraction

### ⏳ Pending

- **Transaction error fix** needed for GraphRAG path

- **LanceDB batching** optimization (future)

- Production backfill completion metrics

### 🔧 Known Limitations

- GraphRAG path blocked by transaction errors (Bug #10)

- Single-core worker limits concurrent processing

- Synchronous LanceDB calls add ~30-60 seconds per chunk

## Next Steps

1. **Fix transaction errors** in GraphRAG ingestion path

2. **Verify production backfill** completes successfully

3. **Optimize LanceDB** with batched concurrent calls

4. **Monitor reaper** to ensure 30-minute threshold is sufficient